Compressed Least-Squares Regression on Sparse Spaces
نویسندگان
چکیده
Recent advances in the area of compressed sensing suggest that it is possible to reconstruct high-dimensional sparse signals from a small number of random projections. Domains in which the sparsity assumption is applicable also offer many interesting large-scale machine learning prediction tasks. It is therefore important to study the effect of random projections as a dimensionality reduction method under such sparsity assumptions. In this paper we develop the bias–variance analysis of a least-squares regression estimator in compressed spaces when random projections are applied on sparse input signals. Leveraging the sparsity assumption, we are able to work with arbitrary non i.i.d. sampling strategies and derive a worst-case bound on the entire space. Empirical results on synthetic and real-world datasets shows how the choice of the projection size affects the performance of regression on compressed spaces, and highlights a range of problems where the method is useful. Modern machine learning methods have to deal with overwhelmingly large datasets, e.g. for text, sound, image and video processing, as well as for time series prediction and analysis. Much of this data contains very high numbers of features or attributes, sometimes exceeding the number of labelled instances available for training. Even though learning from such data may seem hopeless, in reality, the data often contains structure which can facilitate the development of learning algorithms. In this paper, we focus on a very common type of structure, in which the instances are sparse, in the sense that a very small percentage of the features in each instance is non-zero. For example, a text may be encoded as a very large feature vector (millions of dimensions) with each feature being 1 if a corresponding dictionary word is present in the text, and zero otherwise. Hence, in each document, a very small number of features will be non-zero. Several algorithms have been designed to deal with this setting (which we discuss in detail at the end of the paper). Here, we focus on a new class of methods for learning in large, sparse feature sets: random projections (Davenport, Wakin, and Baraniuk 2006; Baraniuk and Wakin 2009). Random projections have originated recently in the signal Copyright c © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. processing literature (Candès and Tao 2006; Candès and Wakin 2008). The idea was motivated by the need to sample and store very efficiently large datasets (such as images and video). The basic idea is that if the signal is generated as a linear combination of a small set of functions (chosen from a much larger set), then it can be reconstructed very well from a small, fixed number of randomized measurements. A solid theoretical foundation has been established for compressed sensing methods, showing that as the number of random measurements increases, the error in the reconstruction decreases at a nearly-optimal rate (Donoho 2006). Compressed sampling has been studied in the context of machine learning from two points of view. One idea is to use random projections to compress the dataset, by combining training instances using random projections (see e.g. Zhou, Lafferty, and Wasserman (2007)). Such methods are useful, for instance, when the training set is too large or one has to handle privacy issues. Another idea is to project each input vector into a lower dimensional space, and then train a predictor in the new compressed space (compression on the feature space). As is typical of dimensionality reduction techniques, this will reduce the variance of most predictors at the expense of introducing some bias. Random projections on the feature space, along with least-squares predictors are studied in Maillard and Munos (2009), and their analysis shows a bias–variance trade-off with respect to on-sample error bounds, which is further extended to bounds on the sampling measure, assuming an i.i.d. sampling strategy. In this paper, we provide a bias–variance analysis of ordinary least-squares (OLS) regression in compressed spaces, referred to as COLS, when random projections are applied on sparse input feature vectors. We show that the sparsity assumption allows us to work with arbitrary non i.i.d. sampling strategies and we derive a worst-case bound on the entire space. The fact that we can work with non-i.i.d. data makes our results applicable to a large range of problems, including video, sound processing, music and time series data. The results allow us to make predictions about the generalization power of the random projection method, outside of the training data. The bound can be used to select the optimal size of the projection, such as to minimize the sum of expected approximation (bias) and estimation (variance) errors. It also provides the means to compare the error of linear predictors in the original and compressed spaces. Notations and Sparsity Assumption Throughout this paper, column vectors are represented by lower case bold letters, and matrices are represented by bold capital letters. |.| denotes the size of a set, and ‖.‖0 is Donoho’s zero “norm” indicating the number of non-zero elements in a vector. ‖.‖ denotes theL norm for vectors and the operator norm for matrices: ‖M‖ = supv ‖Mv‖/‖v‖. Also, we denote the Moore-Penrose pseudo-inverse of a matrix M with M† and the smallest singular value of M by σ (M) min . We will be working in sparse input spaces for our prediction task. Our input is represented by a vector x ∈ X of D features, having ‖x‖ ≤ 1. We assume that x is ksparse in some known or unknown basis Ψ, implying that X , {Ψz, s.t. ‖z‖0 ≤ k and ‖z‖ ≤ 1}. For a concrete example, the signals can be natural images and Ψ can represent these signals in the frequency domain (e.g., see Olshausen, Sallee, and Lewicki (2001)). The on-sample error of a regressor is the expected error when the input is drawn from the empirical distribution (the expected error when the input is chosen uniformly from the training set), and the off-sample error is the error on a measure other than the empirical one. Random Projections and Inner Product It is well known that random projections of appropriate sizes preserve enough information for exact reconstruction with high probability (see e.g. Davenport, Wakin, and Baraniuk (2006), Candès and Wakin (2008)). In this section, we show that a function (almost-)linear in the original space is almost linear in the projected space, when we have random projections of appropriate sizes. There are several types of random projection matrices that can be used. In this work, we assume that each entry in a projection ΦD×d is an i.i.d. sample from a Gaussian 1: φi,j = N (0, 1/d). (1) We build our work on the following (based on theorem 4.1 from Davenport, Wakin, and Baraniuk (2006)), which shows that for a finite set of points, inner product with a fixed vector is almost preserved after a random projection. Theorem 1. (Davenport, Wakin, and Baraniuk (2006)) Let ΦD×d be a random projection according to Eqn 1. Let S be a finite set of points in R. Then for any fixed w ∈ R and > 0: ∀s ∈ S : ∣∣〈ΦTw,ΦT s〉 − 〈w, s〉∣∣ ≤ ‖w‖‖s‖, (2) fails with probability less than (4|S|+ 2)e−d 2/48. We derive the corresponding theorem for sparse feature spaces. The elements of the projection are typically taken to be distributed withN (0, 1/D), but we scale them by √ D/d, so that we avoid scaling the projected values (see e.g. Davenport, Wakin, and Baraniuk (2006)). Theorem 2. Let ΦD×d be a random projection according to Eqn 1. Let X be a D-dimensional k-sparse space. Then for any fixed w and > 0: ∀x ∈ X : ∣∣〈ΦTw,ΦTx〉 − 〈w,x〉∣∣ ≤ ‖w‖‖x‖, (3) fails with probability less than: (eD/k)[4(12/ ) + 2]e−d /192 ≤ e log(12eD/ k)−d /192+log . The proof is attached in the appendix. Note that the above theorem does not require w to be in the sparse space, and thus is different from guarantees on the preservation of the inner product between vectors in a sparse space. Bias–Variance Analysis of Ordinary Least-Squares In this section, we analyze the worst case prediction error of the OLS solution. Then, we proceed to the main result of this paper, which is the bias–variance analysis of OLS in the projected space. We seek to predict a signal f that is assumed to be a (near-) linear function of x ∈ X : f(x) = xw + bf (x), where |bf (x)| ≤ f , (4) for some f > 0, where we assume ‖w‖ ≤ 1. We are given a training set of n input–output pairs, consisting of a full-rank input matrix Xn×D, along with noisy observations of f : y = Xw + bf + η, (5) where for the additive bias term (overloading the notation) bf,i = bf (xi); and we assume a homoscedastic noise term η to be a vector of i.i.d. random variables distributed as N (0, σ η). Given the above, we seek to find a predictor that for any query x ∈ X predicts the target signal f(x). The following lemma provides a bound on the worst-case error of the ordinary least-squares predictor. This lemma is partly a classical result in linear prediction theory and given here with a proof mainly for completeness. Lemma 3. Let wols be the OLS solution of Eqn 5 with additive bias bounded by f and i.i.d. noise with variance σ η . Then for any 0 < δvar ≤ √ 2/eπ, for any x ∈ X , with probability no less than 1 − δvar the error in the OLS prediction follows this bound:
منابع مشابه
Least-Squares Regression on Sparse Spaces
Another application is when one uses random projections to project each input vector into a lower dimensional space, and then train a predictor in the new compressed space (compression on the feature space). As is typical of dimensionality reduction techniques, this will reduce the variance of most predictors at the expense of introducing some bias. Random projections on the feature space, alon...
متن کاملar X iv : 0 70 6 . 05 34 v 1 [ st at . M L ] 4 J un 2 00 7 Compressed Regression
Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. This line of work shows that l1-regularized least squares regression can accurately estimate a sparse linear model from n noisy examples in p dimensions, even if p is much larger than n. In this paper we study a...
متن کاملAnnotated Bibliography High-dimensional Statistical Inference
Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. This line of work shows that l1-regularized least squares regression can accurately estimate a sparse linear model from n noisy examples in p dimensions, even if p is much larger than n. In this paper we study a...
متن کاملRobust regression in RKHS - An overview
The paper deals with the task of robust nonlinear regression in the presence of outliers. The problem is dealt in the context of reproducing kernel Hilbert spaces (RKHS). In contrast to more classical approaches, a recent trend is to model the outliers as a sparse vector noise component and mobilize tools from the sparsity-aware/compressed sensing theory to impose sparsity on it. In this paper,...
متن کاملSparse recovery by thresholded non-negative least squares
Non-negative data are commonly encountered in numerous fields, making nonnegative least squares regression (NNLS) a frequently used tool. At least relative to its simplicity, it often performs rather well in practice. Serious doubts about its usefulness arise for modern high-dimensional linear models. Even in this setting − unlike first intuition may suggest − we show that for a broad class of ...
متن کاملData driven discovery of nonlinear dynamics
We demonstrate that sparse regression and compressive sensing techniques are capable of accurately determining a set of functions governing a nonlinear dynamical system. We analyze a technique introduced by Brunton, Proctor, and Kutz, 2016 [1] that builds a sparse representation of a dynamical system by computing sequential least squares fittings of the data to identify the governing equations....
متن کامل